GROUP NO - 38
Names of the group members:-
PRANAV JADHAV——(0774109)
SAHAJ PATEL———–(0774578)
GATI SONANI————(0779956)
HARSH TRIVEDI——–(0788765)
PUNAM DESAI———-(0785752)
URVASHI PRAJAPATI-(0785750)
Our Project represents our own work and we have adhered to St. Clair College’s Academic Integrity policies in completing this project.
## # A tibble: 2,930 x 34
## Lot_Frontage Lot_Area Year_Built Year_Remod_Add Mas_Vnr_Area BsmtFin_SF_1
## <dbl> <int> <int> <int> <dbl> <dbl>
## 1 141 31770 1960 1960 112 2
## 2 80 11622 1961 1961 0 6
## 3 81 14267 1958 1958 108 1
## 4 93 11160 1968 1968 0 1
## 5 74 13830 1997 1998 0 3
## 6 78 9978 1998 1998 20 3
## 7 41 4920 2001 2001 0 3
## 8 43 5005 1992 1992 0 1
## 9 39 5389 1995 1996 0 3
## 10 60 7500 1999 1999 0 7
## # ... with 2,920 more rows, and 28 more variables: BsmtFin_SF_2 <dbl>,
## # Bsmt_Unf_SF <dbl>, Total_Bsmt_SF <dbl>, First_Flr_SF <int>,
## # Second_Flr_SF <int>, Gr_Liv_Area <int>, Bsmt_Full_Bath <dbl>,
## # Bsmt_Half_Bath <dbl>, Full_Bath <int>, Half_Bath <int>,
## # Bedroom_AbvGr <int>, Kitchen_AbvGr <int>, TotRms_AbvGrd <int>,
## # Fireplaces <int>, Garage_Cars <dbl>, Garage_Area <dbl>, Wood_Deck_SF <int>,
## # Open_Porch_SF <int>, Enclosed_Porch <int>, Three_season_porch <int>,
## # Screen_Porch <int>, Pool_Area <int>, Misc_Val <int>, Mo_Sold <int>,
## # Year_Sold <int>, Sale_Price <int>, Longitude <dbl>, Latitude <dbl>
skim(ames_numeric)
| Name | ames_numeric |
| Number of rows | 2930 |
| Number of columns | 34 |
| _______________________ | |
| Column type frequency: | |
| numeric | 34 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Lot_Frontage | 0 | 1 | 57.65 | 33.50 | 0.00 | 43.00 | 63.00 | 78.00 | 313.00 | ▇▇▁▁▁ |
| Lot_Area | 0 | 1 | 10147.92 | 7880.02 | 1300.00 | 7440.25 | 9436.50 | 11555.25 | 215245.00 | ▇▁▁▁▁ |
| Year_Built | 0 | 1 | 1971.36 | 30.25 | 1872.00 | 1954.00 | 1973.00 | 2001.00 | 2010.00 | ▁▂▃▆▇ |
| Year_Remod_Add | 0 | 1 | 1984.27 | 20.86 | 1950.00 | 1965.00 | 1993.00 | 2004.00 | 2010.00 | ▅▂▂▃▇ |
| Mas_Vnr_Area | 0 | 1 | 101.10 | 178.63 | 0.00 | 0.00 | 0.00 | 162.75 | 1600.00 | ▇▁▁▁▁ |
| BsmtFin_SF_1 | 0 | 1 | 4.18 | 2.23 | 0.00 | 3.00 | 3.00 | 7.00 | 7.00 | ▃▂▇▁▇ |
| BsmtFin_SF_2 | 0 | 1 | 49.71 | 169.14 | 0.00 | 0.00 | 0.00 | 0.00 | 1526.00 | ▇▁▁▁▁ |
| Bsmt_Unf_SF | 0 | 1 | 559.07 | 439.54 | 0.00 | 219.00 | 465.50 | 801.75 | 2336.00 | ▇▅▂▁▁ |
| Total_Bsmt_SF | 0 | 1 | 1051.26 | 440.97 | 0.00 | 793.00 | 990.00 | 1301.50 | 6110.00 | ▇▃▁▁▁ |
| First_Flr_SF | 0 | 1 | 1159.56 | 391.89 | 334.00 | 876.25 | 1084.00 | 1384.00 | 5095.00 | ▇▃▁▁▁ |
| Second_Flr_SF | 0 | 1 | 335.46 | 428.40 | 0.00 | 0.00 | 0.00 | 703.75 | 2065.00 | ▇▃▂▁▁ |
| Gr_Liv_Area | 0 | 1 | 1499.69 | 505.51 | 334.00 | 1126.00 | 1442.00 | 1742.75 | 5642.00 | ▇▇▁▁▁ |
| Bsmt_Full_Bath | 0 | 1 | 0.43 | 0.52 | 0.00 | 0.00 | 0.00 | 1.00 | 3.00 | ▇▆▁▁▁ |
| Bsmt_Half_Bath | 0 | 1 | 0.06 | 0.25 | 0.00 | 0.00 | 0.00 | 0.00 | 2.00 | ▇▁▁▁▁ |
| Full_Bath | 0 | 1 | 1.57 | 0.55 | 0.00 | 1.00 | 2.00 | 2.00 | 4.00 | ▁▇▇▁▁ |
| Half_Bath | 0 | 1 | 0.38 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 2.00 | ▇▁▅▁▁ |
| Bedroom_AbvGr | 0 | 1 | 2.85 | 0.83 | 0.00 | 2.00 | 3.00 | 3.00 | 8.00 | ▁▇▂▁▁ |
| Kitchen_AbvGr | 0 | 1 | 1.04 | 0.21 | 0.00 | 1.00 | 1.00 | 1.00 | 3.00 | ▁▇▁▁▁ |
| TotRms_AbvGrd | 0 | 1 | 6.44 | 1.57 | 2.00 | 5.00 | 6.00 | 7.00 | 15.00 | ▁▇▂▁▁ |
| Fireplaces | 0 | 1 | 0.60 | 0.65 | 0.00 | 0.00 | 1.00 | 1.00 | 4.00 | ▇▇▁▁▁ |
| Garage_Cars | 0 | 1 | 1.77 | 0.76 | 0.00 | 1.00 | 2.00 | 2.00 | 5.00 | ▅▇▂▁▁ |
| Garage_Area | 0 | 1 | 472.66 | 215.19 | 0.00 | 320.00 | 480.00 | 576.00 | 1488.00 | ▃▇▃▁▁ |
| Wood_Deck_SF | 0 | 1 | 93.75 | 126.36 | 0.00 | 0.00 | 0.00 | 168.00 | 1424.00 | ▇▁▁▁▁ |
| Open_Porch_SF | 0 | 1 | 47.53 | 67.48 | 0.00 | 0.00 | 27.00 | 70.00 | 742.00 | ▇▁▁▁▁ |
| Enclosed_Porch | 0 | 1 | 23.01 | 64.14 | 0.00 | 0.00 | 0.00 | 0.00 | 1012.00 | ▇▁▁▁▁ |
| Three_season_porch | 0 | 1 | 2.59 | 25.14 | 0.00 | 0.00 | 0.00 | 0.00 | 508.00 | ▇▁▁▁▁ |
| Screen_Porch | 0 | 1 | 16.00 | 56.09 | 0.00 | 0.00 | 0.00 | 0.00 | 576.00 | ▇▁▁▁▁ |
| Pool_Area | 0 | 1 | 2.24 | 35.60 | 0.00 | 0.00 | 0.00 | 0.00 | 800.00 | ▇▁▁▁▁ |
| Misc_Val | 0 | 1 | 50.64 | 566.34 | 0.00 | 0.00 | 0.00 | 0.00 | 17000.00 | ▇▁▁▁▁ |
| Mo_Sold | 0 | 1 | 6.22 | 2.71 | 1.00 | 4.00 | 6.00 | 8.00 | 12.00 | ▅▆▇▃▃ |
| Year_Sold | 0 | 1 | 2007.79 | 1.32 | 2006.00 | 2007.00 | 2008.00 | 2009.00 | 2010.00 | ▇▇▇▇▃ |
| Sale_Price | 0 | 1 | 180796.06 | 79886.69 | 12789.00 | 129500.00 | 160000.00 | 213500.00 | 755000.00 | ▇▇▁▁▁ |
| Longitude | 0 | 1 | -93.64 | 0.03 | -93.69 | -93.66 | -93.64 | -93.62 | -93.58 | ▅▅▇▆▁ |
| Latitude | 0 | 1 | 42.03 | 0.02 | 41.99 | 42.02 | 42.03 | 42.05 | 42.06 | ▂▂▇▇▇ |
glimpse(ames_numeric)
## Rows: 2,930
## Columns: 34
## $ Lot_Frontage <dbl> 141, 80, 81, 93, 74, 78, 41, 43, 39, 60, 75, 0, 63,~
## $ Lot_Area <int> 31770, 11622, 14267, 11160, 13830, 9978, 4920, 5005~
## $ Year_Built <int> 1960, 1961, 1958, 1968, 1997, 1998, 2001, 1992, 199~
## $ Year_Remod_Add <int> 1960, 1961, 1958, 1968, 1998, 1998, 2001, 1992, 199~
## $ Mas_Vnr_Area <dbl> 112, 0, 108, 0, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6~
## $ BsmtFin_SF_1 <dbl> 2, 6, 1, 1, 3, 3, 3, 1, 3, 7, 7, 1, 7, 3, 3, 1, 3, ~
## $ BsmtFin_SF_2 <dbl> 0, 144, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1120, 0~
## $ Bsmt_Unf_SF <dbl> 441, 270, 406, 1045, 137, 324, 722, 1017, 415, 994,~
## $ Total_Bsmt_SF <dbl> 1080, 882, 1329, 2110, 928, 926, 1338, 1280, 1595, ~
## $ First_Flr_SF <int> 1656, 896, 1329, 2110, 928, 926, 1338, 1280, 1616, ~
## $ Second_Flr_SF <int> 0, 0, 0, 0, 701, 678, 0, 0, 0, 776, 892, 0, 676, 0,~
## $ Gr_Liv_Area <int> 1656, 896, 1329, 2110, 1629, 1604, 1338, 1280, 1616~
## $ Bsmt_Full_Bath <dbl> 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, ~
## $ Bsmt_Half_Bath <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Full_Bath <int> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 3, 2, ~
## $ Half_Bath <int> 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, ~
## $ Bedroom_AbvGr <int> 3, 2, 3, 3, 3, 3, 2, 2, 2, 3, 3, 3, 3, 2, 1, 4, 4, ~
## $ Kitchen_AbvGr <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ TotRms_AbvGrd <int> 7, 5, 6, 8, 6, 7, 6, 5, 5, 7, 7, 6, 7, 5, 4, 12, 8,~
## $ Fireplaces <int> 2, 0, 0, 2, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, ~
## $ Garage_Cars <dbl> 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, ~
## $ Garage_Area <dbl> 528, 730, 312, 522, 482, 470, 582, 506, 608, 442, 4~
## $ Wood_Deck_SF <int> 210, 140, 393, 0, 212, 360, 0, 0, 237, 140, 157, 48~
## $ Open_Porch_SF <int> 62, 0, 36, 0, 34, 36, 0, 82, 152, 60, 84, 21, 75, 0~
## $ Enclosed_Porch <int> 0, 0, 0, 0, 0, 0, 170, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ Three_season_porch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Screen_Porch <int> 0, 120, 0, 0, 0, 0, 0, 144, 0, 0, 0, 0, 0, 0, 140, ~
## $ Pool_Area <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Misc_Val <int> 0, 0, 12500, 0, 0, 0, 0, 0, 0, 0, 0, 500, 0, 0, 0, ~
## $ Mo_Sold <int> 5, 6, 6, 4, 3, 6, 4, 1, 3, 6, 4, 3, 5, 2, 6, 6, 6, ~
## $ Year_Sold <int> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 201~
## $ Sale_Price <int> 215000, 105000, 172000, 244000, 189900, 195500, 213~
## $ Longitude <dbl> -93.61975, -93.61976, -93.61939, -93.61732, -93.638~
## $ Latitude <dbl> 42.05403, 42.05301, 42.05266, 42.05125, 42.06090, 4~
A few examples. You can find more about tabs in rmarkdown here
1)Year Sold
ggplot(ames_numeric,aes(Year_Sold)) + geom_bar(fill="sky blue", color="black",width = 0.5)+
labs(title="Year wise distribution of house sold",x="Year sold")
#Description
The Plot shows the distribution of houses sold on year on year,it shows that most of the houses sold counts were more than 600 but in 2010 it went down to 375.
a<-ggplot(ames_numeric,aes(Year_Built)) +geom_histogram(fill="orange", color="black", binwidth = 10)+
labs(title="Year wise distribution of house built",x="Year built") + coord_flip()
ggplotly(a)
#Description Plot shows the distribution of houses built based on year ,it shows that most of the houses built in year 2000 and the less houses were built in year 1900 and before 1900.
ggplot(ames_numeric) + geom_bar(mapping = aes(Garage_Cars),fill="orange", color="black")+
labs(title="Distribution of car's garage",x="Garage cars")
#Description Above plot depicts the car garages in houses ,so here most of the houses has the 2 car gareges.
ggplot(ames_numeric)+ geom_histogram(mapping = aes(Sale_Price),fill="light green", color="white")+
labs(title="Distribution of sale price",x="Sale price")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#Description The plot shows the distribution of sale prices of the houses.
5)Ground Living Area
abc <- ggplot(ames_numeric, aes(Gr_Liv_Area,binwidth = 5,fill=I("blue"),col=I("red"))) +
labs(title="Houses with their graded living areas",x="Graded living area")+
geom_histogram()
ggplotly(abc)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#Distribution The plot shows the distribution of graded living area of the houses , here most of the houses has the gradede living area between 900 to 1600 square feet.
We show the data in this tab. 1)
ggplot(ames_numeric) + geom_point(mapping=aes(Year_Built,Sale_Price))+ geom_smooth(mapping=aes(Year_Built,Sale_Price))+
labs(title="Sale price based on year built",x="Year built",y="Sale price")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
#Description Plot shows the distribution of sale price according to year built, so here we can see that in 1900 and in 1940 the trend of sale price was similar but after 1940 linear trend slightly went up.
Boxplot <- ames_numeric %>% select(Year_Sold , Sale_Price) %>% round(6) %>% ggplot() + geom_boxplot(aes(Year_Sold , Sale_Price,group=Year_Sold, fill=Year_Sold), outliers="red")+
labs(title="Sale price according to year sold",x="Year sold",y="Sale price")
## Warning: Ignoring unknown parameters: outliers
Boxplot
#Description The box plot shows the distribution of Year sold and sale prices of houses.
mh<- ggplot(ames_numeric)+
geom_point(mapping = aes(x=Gr_Liv_Area,y=Year_Built),color = "Green") +
labs(title="Graded living area based on year built",x="Graded living area",y="Year built")
ggplotly(mh)
#Description
The above plot shows the distribution of year built and graded living area ,We can see most of the graded living area of houses are lied between 1000 to 2000 square feet .
Ah<- ggplot(ames_numeric)+
geom_point(mapping = aes(x=Lot_Frontage,y=Sale_Price)) +
labs(title="Sale price according to lot frontage",x="Lot frontage",y="Sale price")
ggplotly(Ah)
#Description Plot displays the distribution of lot frontge area and the sale price,it clearly shows that there are many houses which does not have lot frontage so that is why it shows 0 and its sale price.
ggplot(ames_numeric)+
geom_point(mapping = aes(x=Lot_Area,y=Sale_Price)) +
labs(title="Sale price according to lot area",x="Lot area",y="Sale price")
#Description The above plot displays the lot area wise sale price ,but most of the houses lot areas are lied between 10000 t0 25000 square feet.
ggplot(ames_numeric) + geom_point(mapping=aes(Year_Built,Sale_Price ,color = Year_Sold ))+
labs(title="Sale price across all the built year ",x="Year built",y="Sale price")
#Description Plot depicts distribution of year built and sale price,the color shows the year sold ,so here from the color we can say that in 1980 the sale price was increased as compare to 1900 and 1940
plotvar <- ames_numeric$Sale_Price # pick a variable to plot
nclr <- 8 # number of colors
plotclr <- brewer.pal(nclr,"PuBu") # get the colors
colornum <- cut(rank(plotvar), nclr, labels=FALSE)
colcode <- plotclr[colornum] # assign color
# scatter plot
plot.angle <- 45
scatterplot3d(ames_numeric$Lot_Frontage, ames_numeric$Gr_Liv_Area, plotvar, type="h", angle=plot.angle, color=colcode, pch=20, cex.symbols=2,
col.axis="gray", col.grid="gray")
Quick description.
ames_numeric %>%
group_by(Year_Sold) %>%
summarize(Mean_Sales_Price = mean(Sale_Price))
## # A tibble: 5 x 2
## Year_Sold Mean_Sales_Price
## <int> <dbl>
## 1 2006 181762.
## 2 2007 185138.
## 3 2008 178842.
## 4 2009 181405.
## 5 2010 172598.
ames_numeric %>%
group_by(Year_Sold) %>% summarize(Mean_Garage_cars = mean(Garage_Cars))
## # A tibble: 5 x 2
## Year_Sold Mean_Garage_cars
## <int> <dbl>
## 1 2006 1.77
## 2 2007 1.80
## 3 2008 1.73
## 4 2009 1.81
## 5 2010 1.67
Quick description.
Sale_Price is the response variable in ames_numeric dataset. It is not normally distributed. We are transforming it using log10 function.
using Sales Price as response variable
ggplot(ames_numeric)+geom_histogram(aes(Sale_Price),fill="orange", color="black")+
labs(title="Response variable without transformation",x="Sale price")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(ames_numeric)+geom_histogram(aes(log10(Sale_Price)),fill="orange", color="black")+
labs(title="Response variable with transformation",x="Sale price")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# some code goes here
m1 <- lm(Sale_Price ~ Garage_Area, data = ames_numeric)
summary(m1)
##
## Call:
## lm(formula = Sale_Price ~ Garage_Area, data = ames_numeric)
##
## Residuals:
## Min 1Q Median 3Q Max
## -284053 -33609 -5318 25286 488808
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 68470.351 2737.269 25.01 <2e-16 ***
## Garage_Area 237.647 5.271 45.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 61380 on 2928 degrees of freedom
## Multiple R-squared: 0.4098, Adjusted R-squared: 0.4096
## F-statistic: 2033 on 1 and 2928 DF, p-value: < 2.2e-16
Equation_Model_1<-extract_eq(m1, use_coefs = TRUE)
Equation_Model_1
\[ \operatorname{\widehat{Sale\_Price}} = 68470.35 + 237.65(\operatorname{Garage\_Area}) \]
m2 <- lm(Sale_Price ~ Gr_Liv_Area, data = ames_numeric)
summary(m2)
##
## Call:
## lm(formula = Sale_Price ~ Gr_Liv_Area, data = ames_numeric)
##
## Residuals:
## Min 1Q Median 3Q Max
## -483467 -30219 -1966 22728 334323
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13289.634 3269.703 4.064 4.94e-05 ***
## Gr_Liv_Area 111.694 2.066 54.061 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 56520 on 2928 degrees of freedom
## Multiple R-squared: 0.4995, Adjusted R-squared: 0.4994
## F-statistic: 2923 on 1 and 2928 DF, p-value: < 2.2e-16
Equation_Model_2<-extract_eq(m2, use_coefs = TRUE)
Equation_Model_2
\[ \operatorname{\widehat{Sale\_Price}} = 13289.63 + 111.69(\operatorname{Gr\_Liv\_Area}) \]
m3 <- lm(Sale_Price ~ Lot_Area, data = ames_numeric)
summary(m3)
##
## Call:
## lm(formula = Sale_Price ~ Lot_Area, data = ames_numeric)
##
## Residuals:
## Min 1Q Median 3Q Max
## -369375 -47827 -18982 31261 549409
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.534e+05 2.320e+03 66.11 <2e-16 ***
## Lot_Area 2.702e+00 1.806e-01 14.96 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 77010 on 2928 degrees of freedom
## Multiple R-squared: 0.07105, Adjusted R-squared: 0.07073
## F-statistic: 223.9 on 1 and 2928 DF, p-value: < 2.2e-16
Equation_Model_3<-extract_eq(m3, use_coefs = TRUE)
Equation_Model_3
\[ \operatorname{\widehat{Sale\_Price}} = 153373.89 + 2.7(\operatorname{Lot\_Area}) \]
library(modelsummary)
models <- list(
"m1" = lm(Sale_Price ~ Garage_Area, data = ames_numeric),
"m2" = lm(Sale_Price ~ Gr_Liv_Area, data = ames_numeric),
"m3" = lm(Sale_Price ~ Lot_Area, data = ames_numeric)
)
modelsummary(models)
| m1 | m2 | m3 | |
|---|---|---|---|
| (Intercept) | 68470.351 | 13289.634 | 153373.893 |
| (2737.269) | (3269.703) | (2319.909) | |
| Garage_Area | 237.647 | ||
| (5.271) | |||
| Gr_Liv_Area | 111.694 | ||
| (2.066) | |||
| Lot_Area | 2.702 | ||
| (0.181) | |||
| Num.Obs. | 2930 | 2930 | 2930 |
| R2 | 0.410 | 0.500 | 0.071 |
| R2 Adj. | 0.410 | 0.499 | 0.071 |
| AIC | 72924.9 | 72441.6 | 74253.9 |
| BIC | 72942.9 | 72459.5 | 74271.8 |
| Log.Lik. | -36459.470 | -36217.791 | -37123.929 |
| F | 2032.837 | 2922.592 | 223.941 |
#Description The metric which we used to compare the three models is R2. The higher R-2 value indicates how well the regression model fits the observed data.In our case the model2(m2) has the highest R2 value that is 0.5. This reveals that 50% of the data fit the regression model.
model_diagnostics<-augment(m2)
model_diagnostics<- model_diagnostics %>% round(6)
model_diagnostics
## # A tibble: 2,930 x 8
## Sale_Price Gr_Liv_Area .fitted .resid .hat .sigma .cooksd .std.resid
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 215000 1656 198255. 16745. 0.000374 56533. 0.000016 0.296
## 2 105000 896 113367. -8367. 0.000828 56534. 0.000009 -0.148
## 3 172000 1329 161731. 10269. 0.00038 56534. 0.000006 0.182
## 4 244000 2110 248964. -4964. 0.000839 56534. 0.000003 -0.0879
## 5 189900 1629 195239. -5339. 0.000364 56534. 0.000002 -0.0945
## 6 195500 1604 192447. 3053. 0.000356 56534. 0.000001 0.0540
## 7 213500 1338 162736. 50764. 0.000376 56526. 0.000152 0.898
## 8 191500 1280 156258. 35242. 0.000406 56530. 0.000079 0.624
## 9 236500 1616 193787. 42713. 0.000359 56528. 0.000103 0.756
## 10 189000 1804 214786. -25786. 0.000465 56532. 0.000048 -0.456
## # ... with 2,920 more rows
ggplot(model_diagnostics, aes(Gr_Liv_Area,Sale_Price)) + geom_point() + geom_line(aes(Gr_Liv_Area,.fitted),color="blue") + geom_segment(data = model_diagnostics %>% slice_sample(n = 30),aes(x=Gr_Liv_Area,y=Sale_Price,xend=Gr_Liv_Area, yend=.fitted), color="red")+
labs(title="Best fit Model showing predicted and observed sale price alnog with residuals",x="Graded living area ",y="Sale Price")
1)Linearity of data.
ggplot(data = model_diagnostics, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
xlab("Fitted values") +
ylab("Residuals")
#Description From the plot above between residual and fitted values we can conclude that it is roughly linear.
2)normality of residuals.
ggplot(data = model_diagnostics, aes(x = .resid)) +
geom_histogram() +
xlab("Residuals")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#Description We can see that the residuals are symmetric and have normal distribution
3)Homogeneity of residuals variance.
4)Independence of residuals error terms.
Show your code. Check all assumptions.
Show your code in a single chunk.
ames_numeric_transformed <- ames_numeric %>% mutate(Sale_Price_log10 = log10(ames_numeric$Sale_Price))
transformed_model<- lm(Sale_Price_log10 ~ Gr_Liv_Area, data = ames_numeric_transformed)
summary(transformed_model)
##
## Call:
## lm(formula = Sale_Price_log10 ~ Gr_Liv_Area, data = ames_numeric_transformed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.02587 -0.06577 0.01342 0.07202 0.39231
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.855e+00 7.355e-03 660.12 <2e-16 ***
## Gr_Liv_Area 2.437e-04 4.648e-06 52.43 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1271 on 2928 degrees of freedom
## Multiple R-squared: 0.4842, Adjusted R-squared: 0.484
## F-statistic: 2749 on 1 and 2928 DF, p-value: < 2.2e-16
ames_numeric_transformed <- augment(transformed_model)
ames_numeric_transformed
## # A tibble: 2,930 x 8
## Sale_Price_log10 Gr_Liv_Area .fitted .resid .hat .sigma .cooksd
## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 5.33 1656 5.26 0.0737 0.000374 0.127 0.0000629
## 2 5.02 896 5.07 -0.0524 0.000828 0.127 0.0000703
## 3 5.24 1329 5.18 0.0565 0.000380 0.127 0.0000375
## 4 5.39 2110 5.37 0.0180 0.000839 0.127 0.00000845
## 5 5.28 1629 5.25 0.0264 0.000364 0.127 0.00000783
## 6 5.29 1604 5.25 0.0451 0.000356 0.127 0.0000224
## 7 5.33 1338 5.18 0.148 0.000376 0.127 0.000256
## 8 5.28 1280 5.17 0.115 0.000406 0.127 0.000166
## 9 5.37 1616 5.25 0.125 0.000359 0.127 0.000173
## 10 5.28 1804 5.29 -0.0183 0.000465 0.127 0.00000484
## # ... with 2,920 more rows, and 1 more variable: .std.resid <dbl>
Equation_Model_transformed<-extract_eq(transformed_model, use_coefs = TRUE)
Equation_Model_transformed
\[ \operatorname{\widehat{Sale\_Price\_log10}} = 4.86 + 0(\operatorname{Gr\_Liv\_Area}) \]
ggplot(ames_numeric_transformed, aes(Gr_Liv_Area,Sale_Price_log10))+ geom_point()+geom_smooth(method=lm, se=FALSE)
## `geom_smooth()` using formula 'y ~ x'
From the equation that we created the y intercept has the value 4.86. Therefore we can say that the value of graded living area is equal to 0 the minimum value of log10 of sale price would be on an average 4.86.
1)Linearity of data.
ggplot(data = ames_numeric_transformed, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
xlab("Fitted values") +
ylab("Residuals")
#Description From the above plot between residuals and fitted .so we can conclude that it is roughly linear.
2)normality of residuals.
ggplot(data = ames_numeric_transformed, aes(x = .resid)) +
geom_histogram() +
xlab("Residuals")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#Description We can see that the residuals are symmetric and have normal distribution.
Product type = podcast
sessionInfo()
## R version 4.0.5 (2021-03-31)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_India.1252 LC_CTYPE=English_India.1252
## [3] LC_MONETARY=English_India.1252 LC_NUMERIC=C
## [5] LC_TIME=English_India.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] modelsummary_0.8.1 RColorBrewer_1.1-2 scatterplot3d_0.3-41
## [4] plotly_4.9.4.1 ggvis_0.4.7 shinyjs_2.0.0
## [7] shiny_1.6.0 equatiomatic_0.2.0 broom_0.7.8
## [10] skimr_2.1.3 modeldata_0.1.1 forcats_0.5.1
## [13] stringr_1.4.0 dplyr_1.0.7 purrr_0.3.4
## [16] readr_1.4.0 tidyr_1.1.3 tibble_3.1.2
## [19] ggplot2_3.3.5 tidyverse_1.3.1
##
## loaded via a namespace (and not attached):
## [1] nlme_3.1-152 fs_1.5.0 lubridate_1.7.10 webshot_0.5.2
## [5] httr_1.4.2 repr_1.1.3 tools_4.0.5 backports_1.2.1
## [9] bslib_0.2.5.1 utf8_1.2.1 R6_2.5.0 DBI_1.1.1
## [13] lazyeval_0.2.2 mgcv_1.8-34 colorspace_2.0-2 withr_2.4.2
## [17] tidyselect_1.1.1 compiler_4.0.5 cli_2.5.0 rvest_1.0.0
## [21] xml2_1.3.2 labeling_0.4.2 sass_0.4.0 checkmate_2.0.0
## [25] scales_1.1.1 tables_0.9.6 systemfonts_1.0.2 digest_0.6.27
## [29] svglite_2.0.0 rmarkdown_2.9 base64enc_0.1-3 pkgconfig_2.0.3
## [33] htmltools_0.5.1.1 dbplyr_2.1.1 fastmap_1.1.0 highr_0.9
## [37] htmlwidgets_1.5.3 rlang_0.4.11 readxl_1.3.1 rstudioapi_0.13
## [41] jquerylib_0.1.4 generics_0.1.0 farver_2.1.0 jsonlite_1.7.2
## [45] crosstalk_1.1.1 magrittr_2.0.1 kableExtra_1.3.4 Matrix_1.3-2
## [49] Rcpp_1.0.6 munsell_0.5.0 fansi_0.5.0 lifecycle_1.0.0
## [53] stringi_1.6.2 yaml_2.2.1 grid_4.0.5 promises_1.2.0.1
## [57] crayon_1.4.1 lattice_0.20-41 haven_2.4.1 splines_4.0.5
## [61] hms_1.1.0 knitr_1.33 pillar_1.6.1 reprex_2.0.0
## [65] glue_1.4.2 evaluate_0.14 data.table_1.14.0 modelr_0.1.8
## [69] vctrs_0.3.8 httpuv_1.6.1 cellranger_1.1.0 gtable_0.3.0
## [73] assertthat_0.2.1 xfun_0.24 mime_0.11 xtable_1.8-4
## [77] later_1.2.0 viridisLite_0.4.0 ellipsis_0.3.2